Week 10: Cluster Analysis
Cluster analysis is a statistical technique used to group similar data points into clusters according to their similarity on the variables.
Applications
Market segmentation: A retail company might group customers into different groups based on buying behavior, demographics, or preferences to target marketing strategies accordingly.
Social Network Analysis: Cluster analysis can be used to detect groups of users with similar connections or interactions on social media platforms.
Recommender Systems: Streaming services like Netflix or Spotify use cluster analysis to group users with similar preferences, enabling the system to offer personalized recommendations.
Genetics and Bioinformatics: Cluster analysis is applied to genetic data to find groups of genes with similar expression patterns.
2D data can be clustered manually when visualized on a scatter plot.
How many clusters can you spot?
2D data can be clustered manually when visualized on a scatter plot.
How many clusters can you spot?
Manually clustering high-dimensional data is more challenging. Tools like a tour or a scatter plot matrix can help visualize it. How many clusters can you spot?
Manually clustering high-dimensional data is more challenging. Tools like a tour or a scatter plot matrix can help visualize it. How many clusters can you spot?
We want to group similar data points into clusters, but how do we determine if two data points are “similar” or “close”?
A common choice is Euclidean distance:
\[d(A, B) = \sqrt{\sum_{k=1}^{p}(x_{ak} - x_{bk})^2} = \sqrt{(\boldsymbol{x}_A - \boldsymbol{x}_B)^\top(\boldsymbol{x}_A - \boldsymbol{x}_B)}\] where \(\boldsymbol{x}_A\) and \(\boldsymbol{x}_B\) are length-\(p\) column vectors representing points A and B, respectively, and \(x_{ak}\) and \(x_{bk}\) denote the \(k\)-th elements of these vectors.
Examples
For point \(A\): \((0, 1)\) and point \(B\): \((2, 3)\), \(d(A, B) = \sqrt{(0 - 2)^2 + (1 - 3)^2} = 2\sqrt{2}.\)
For point \(A\): \((0, 1, 2)\) and point \(B\): \((2, 3, 4)\), \(d(A, B) = \sqrt{(0 - 2)^2 + (1 - 3)^2 + (2 - 4)^2} = 2\sqrt{3}.\)
Mahalanobis distance: \(\sqrt{(\boldsymbol{x}_A - \boldsymbol{x}_B)^\top S^{-1}(\boldsymbol{x}_A - \boldsymbol{x}_B)}\)
ggplot() +
geom_point(data = mvtnorm::rmvnorm(10000, mean = c(0, 0), sigma = diag(c(4.17, 0.668))) %>%
as.data.frame(),
aes(V1, V2), alpha = 0.02) +
ggforce::geom_ellipse(aes(x0 = 0, y0 = 0, a = 5, b = 2, angle = 0), linetype = 2) +
annotate("point", x = 0, y = 0) +
annotate("text", x = 0.5, y = 0.5, label = "A") +
annotate("point", x = 5, y = 0) +
annotate("text", x = 5.5, y = 0.5, label = "B") +
annotate("point", x = 0, y = 2) +
annotate("text", x = 0.5, y = 2.5, label = "C") +
annotate("point", x = 2, y = 0) +
annotate("text", x = 2.5, y = 0.5, label = "D") +
coord_fixed() +
theme_light() +
scale_x_continuous(breaks = seq(-10, 10)) +
ggtitle("D(A, B) = D(A, C) > D(A,D)") +
xlab("X") +
ylab("Y")Manhanttan distance: \(\sum_{k=1}^p|x_{ak} - x_{bk}|\)
ggplot() +
annotate("point", x = 0, y = 0) +
annotate("text", x = 0.1, y = 0.1, label = "A") +
annotate("point", x = 0, y = 2) +
annotate("text", x = 0.1, y = 2.1, label = "C") +
annotate("point", x = 1, y = 1) +
annotate("text", x = 1.1, y = 1.1, label = "B") +
annotate("point", x = 1, y = 2) +
annotate("text", x = 1.1, y = 2.1, label = "D") +
annotate("segment", x = 0, y = 0, xend = 0, yend = 1, linetype = 2) +
annotate("segment", x = 0, y = 1, xend = 1, yend = 1, linetype = 2) +
annotate("segment", x = 0, y = 1, xend = 0, yend = 2, linetype = 2) +
annotate("segment", x = 1, y = 1, xend = 1, yend = 2, linetype = 2) +
coord_fixed() +
theme_light() +
scale_x_continuous(breaks = seq(0, 1)) +
scale_y_continuous(breaks = seq(0, 2)) +
ggtitle("D(A, D) = 3 > D(A, B) = D(A, C) = 2") +
xlab("X") +
ylab("Y")A distance metric must follow these conditions:
Can you come out with your own distance metric?
Tip
Many divergences used in statistics, like the Kullback–Leibler divergence, do not qualify as metrics.
Euclidean distance is unsuitable for categorical variables, and you’ll explore alternative distance metrics in ETC3250/ETC5250.
K-means is an iterative clustering algorithm which requires specifying the number of clusters (\(k\)) in advance.
The goal is to form \(k\) clusters that minimize within-cluster variance, expressed as:
\[\underset{\boldsymbol{S}}{\text{arg }\text{min}}\sum_{i=1}^{k}\underbrace{\sum_{\boldsymbol{x} \in S_i}\frac{1}{|S_i|}\lVert \boldsymbol{x} - \boldsymbol{\mu}_i\rVert^2}_{\text{variance of cluster } i},\]
where \(\boldsymbol{S} = \{S_1, S_2, \dots, S_k\}\) represents the \(k\) clusters, \(|S_i|\) is the number of points in cluster \(i\), and \(\boldsymbol{\mu}_i\) is the mean of the \(i\)-th cluster.
Smaller within-cluster variance indicates that the data points within each cluster are more tightly grouped together.
set.seed(2024)
dat <- bind_rows(mvtnorm::rmvnorm(n = 100, mean = c(5, 5), sigma = matrix(c(1, -0.7, -0.7, 1), ncol = 2)) %>%
as.data.frame(),
mvtnorm::rmvnorm(n = 100, mean = c(10, 10), sigma = matrix(c(1, -0.5, -0.5, 1), ncol = 2)) %>%
as.data.frame(),
.id = "set") %>%
mutate(cluster = c("Cluster 1", "Cluster 2")[as.integer(V2/V1 > 1) + 1])
dat %>%
ggplot() +
geom_point(aes(V1, V2, col = cluster)) +
theme_light() +
xlab("X") +
ylab("Y") +
coord_fixed() +
labs(col = "") +
ggtitle(dat %>%
group_by(cluster) %>%
summarise(var = var(V1) + var(V2)) %>%
pull(var) %>%
c(sum(.)) %>%
format(digits = 3) %>%
{glue::glue("Within-cluster variance: {.[1]} + {.[2]} = {.[3]}")})set.seed(2024)
dat <- bind_rows(mvtnorm::rmvnorm(n = 100, mean = c(5, 5), sigma = matrix(c(1, -0.7, -0.7, 1), ncol = 2)) %>%
as.data.frame(),
mvtnorm::rmvnorm(n = 100, mean = c(10, 10), sigma = matrix(c(1, -0.5, -0.5, 1), ncol = 2)) %>%
as.data.frame(),
.id = "set") %>%
mutate(set = as.integer(set)) %>%
mutate(cluster = c("Cluster 1", "Cluster 2")[set])
dat %>%
ggplot() +
geom_point(aes(V1, V2, col = cluster)) +
theme_light() +
xlab("X") +
ylab("Y") +
coord_fixed() +
labs(col = "") +
ggtitle(dat %>%
group_by(cluster) %>%
summarise(var = var(V1) + var(V2)) %>%
pull(var) %>%
c(sum(.)) %>%
format(digits = 3) %>%
{glue::glue("Within-cluster variance: {.[1]} + {.[2]} = {.[3]}")})It is not feasible to find every possible way to create \(k\) clusters. How many ways are there to form \(k\) clusters?
\[S(n, k) = \frac{1}{k!}\sum_{i = 0}^{k}(-1)^{k-i}{k \choose i}i^{n}.\]
However, comparing two solutions is relatively straightforward (see previous slide).
K-means is an example of an NP-hard problem:
A solution can be verified in polynomial time.
Finding the optimal solution may require an exponential amount of time.
NP-hard problems cannot be easily solved.
We often rely on approximation methods to find good solutions.
Let’s try to cluster 8 data points into 2 groups using the Lloyd’s algorithm.
| obs | x | y |
|---|---|---|
| A | 8 | 2 |
| B | 2 | 4 |
| C | 7 | 2 |
| D | 10 | 4 |
| E | 5 | 5 |
| F | 3 | 10 |
| G | 1 | 10 |
| H | 1 | 7 |
Set \(k = 2\), and randomly divide the data into 2 groups.
| obs | x | y | cluster |
|---|---|---|---|
| A | 8 | 2 | 1 |
| B | 2 | 4 | 2 |
| C | 7 | 2 | 2 |
| D | 10 | 4 | 2 |
| E | 5 | 5 | 1 |
| F | 3 | 10 | 2 |
| G | 1 | 10 | 1 |
| H | 1 | 7 | 1 |
Compute the means for each group. \(\boldsymbol{\mu}_1 = (3.75, 6)\) and \(\boldsymbol{\mu}_2 = (5.5, 5)\).
| obs | x | y | cluster |
|---|---|---|---|
| A | 8 | 2 | 1 |
| B | 2 | 4 | 2 |
| C | 7 | 2 | 2 |
| D | 10 | 4 | 2 |
| E | 5 | 5 | 1 |
| F | 3 | 10 | 2 |
| G | 1 | 10 | 1 |
| H | 1 | 7 | 1 |
Find the nearest mean for each observation.
| obs | x | y | cluster | d1 | d2 | near |
|---|---|---|---|---|---|---|
| A | 8 | 2 | 1 | 5.84 | 3.91 | 2 |
| B | 2 | 4 | 2 | 2.66 | 3.64 | 1 |
| C | 7 | 2 | 2 | 5.15 | 3.35 | 2 |
| D | 10 | 4 | 2 | 6.56 | 4.61 | 2 |
| E | 5 | 5 | 1 | 1.60 | 0.50 | 2 |
| F | 3 | 10 | 2 | 4.07 | 5.59 | 1 |
| G | 1 | 10 | 1 | 4.85 | 6.73 | 1 |
| H | 1 | 7 | 1 | 2.93 | 4.92 | 1 |
Update the memberships.
| obs | x | y | cluster | d1 | d2 | near |
|---|---|---|---|---|---|---|
| A | 8 | 2 | 2 | 5.84 | 3.91 | 2 |
| B | 2 | 4 | 1 | 2.66 | 3.64 | 1 |
| C | 7 | 2 | 2 | 5.15 | 3.35 | 2 |
| D | 10 | 4 | 2 | 6.56 | 4.61 | 2 |
| E | 5 | 5 | 2 | 1.60 | 0.50 | 2 |
| F | 3 | 10 | 1 | 4.07 | 5.59 | 1 |
| G | 1 | 10 | 1 | 4.85 | 6.73 | 1 |
| H | 1 | 7 | 1 | 2.93 | 4.92 | 1 |
Recompute the means for each group. \(\boldsymbol{\mu}_1 = (1.75, 7.75)\) and \(\boldsymbol{\mu}_2 = (7.5, 3.25)\).
| obs | x | y | cluster |
|---|---|---|---|
| A | 8 | 2 | 2 |
| B | 2 | 4 | 1 |
| C | 7 | 2 | 2 |
| D | 10 | 4 | 2 |
| E | 5 | 5 | 2 |
| F | 3 | 10 | 1 |
| G | 1 | 10 | 1 |
| H | 1 | 7 | 1 |
Find the nearest mean for each observation. No observations need to be reassigned, the algorithm has converged.
| obs | x | y | cluster | d1 | d2 | near |
|---|---|---|---|---|---|---|
| A | 8 | 2 | 2 | 5.84 | 3.91 | 2 |
| B | 2 | 4 | 1 | 2.66 | 3.64 | 1 |
| C | 7 | 2 | 2 | 5.15 | 3.35 | 2 |
| D | 10 | 4 | 2 | 6.56 | 4.61 | 2 |
| E | 5 | 5 | 2 | 1.60 | 0.50 | 2 |
| F | 3 | 10 | 1 | 4.07 | 5.59 | 1 |
| G | 1 | 10 | 1 | 4.85 | 6.73 | 1 |
| H | 1 | 7 | 1 | 2.93 | 4.92 | 1 |
oliveAlthough we know the data comes from three regions, we typically do not have this information beforehand for other datasets.
Tip
Experiment with different values of \(k\).
Evaluate the solutions using plots of the data or in a two-way table.
Be sure to use set.seed() since the results can depend on the initialization.
Switch cluster labels to ensure the largest values appear on the diagonal of the two-way table.
Is it working well? Try using different variables for x and y.
The algorithm will tend to segment the data into roughly equal sized, or spherical clusters, and thus will work well if the clusters are separated and equally spherical in shape.
If gaps appear within a single cluster, it suggests that k-means has failed to capture important underlying cluster structure.
by Interactively exploring high-dimensional data and models in R (Di Cook and Ursula Laa)
There is no correct number of clusters. Choosing the number of clusters depends on the context.
Do not choose too many clusters:
A firm developing a different marketing strategy for each market segment may not have the resources to develop a large number of unique strategies.
Do not choose too few clusters:
If you choose the 1-cluster solution there is no point in doing clustering at all.
The fpc package offers various cluster statistics that can assist you in comparing cluster solutions and determining the optimal number of clusters.
For more details, see ?cluster.stats().
Some commonly used cluster statistics
within.cluster.ss: Sum of squares within clusters. A lower value is better. This metric always decreases with more clusters, so focus on larger drops.wb.ratio: The ratio of the average distance within clusters to the average distance between clusters. A lower value is preferable. This also decreases with more clusters, so choose when there are significant drops.dunn: The minimum separation between clusters divided by the maximum cluster diameter. A higher value is better.ch: The Calinski-Harabasz Index, which should also be high for a good clustering.There are two types of hierarchical clustering:
Hierarchical clustering is a recursive algorithm:
Distance between sets
We know how to calculate the distance between two individual points, but how do we compute the distance between a point and a set of points once they are merged?
For instance, how do we calculate the distance between point A and the set (B, C)? Or between the sets (A, C) and (B, D)?
Linkage methods can be used to measure dissimilarity between sets of observations:
- Images from ETC3250/ETC5250 lecture 9.
Hierarchical clustering begins by computing the distance matrix for the dataset.
Select an appropriate distance metric at this stage. Here, we are using Euclidean distance.
| A | B | C | D | E | F | G | H | |
|---|---|---|---|---|---|---|---|---|
| A | 0.00 | 6.32 | 1.00 | 2.83 | 4.24 | 9.43 | 10.63 | 8.60 |
| B | 6.32 | 0.00 | 5.39 | 8.00 | 3.16 | 6.08 | 6.08 | 3.16 |
| C | 1.00 | 5.39 | 0.00 | 3.61 | 3.61 | 8.94 | 10.00 | 7.81 |
| D | 2.83 | 8.00 | 3.61 | 0.00 | 5.10 | 9.22 | 10.82 | 9.49 |
| E | 4.24 | 3.16 | 3.61 | 5.10 | 0.00 | 5.39 | 6.40 | 4.47 |
| F | 9.43 | 6.08 | 8.94 | 9.22 | 5.39 | 0.00 | 2.00 | 3.61 |
| G | 10.63 | 6.08 | 10.00 | 10.82 | 6.40 | 2.00 | 0.00 | 3.00 |
| H | 8.60 | 3.16 | 7.81 | 9.49 | 4.47 | 3.61 | 3.00 | 0.00 |
Then choose a linkage and starts finding two closest data points. Here we use single linkage.
| A | B | C | D | E | F | G | H | |
|---|---|---|---|---|---|---|---|---|
| A | 0.00 | 6.32 | 1.00 | 2.83 | 4.24 | 9.43 | 10.63 | 8.60 |
| B | 6.32 | 0.00 | 5.39 | 8.00 | 3.16 | 6.08 | 6.08 | 3.16 |
| C | 1.00 | 5.39 | 0.00 | 3.61 | 3.61 | 8.94 | 10.00 | 7.81 |
| D | 2.83 | 8.00 | 3.61 | 0.00 | 5.10 | 9.22 | 10.82 | 9.49 |
| E | 4.24 | 3.16 | 3.61 | 5.10 | 0.00 | 5.39 | 6.40 | 4.47 |
| F | 9.43 | 6.08 | 8.94 | 9.22 | 5.39 | 0.00 | 2.00 | 3.61 |
| G | 10.63 | 6.08 | 10.00 | 10.82 | 6.40 | 2.00 | 0.00 | 3.00 |
| H | 8.60 | 3.16 | 7.81 | 9.49 | 4.47 | 3.61 | 3.00 | 0.00 |
Update the distance matrix using single linkage.
| (A,C) | B | D | E | F | G | H | |
|---|---|---|---|---|---|---|---|
| (A,C) | 0.00 | 5.39 | 2.83 | 3.61 | 8.94 | 10.00 | 7.81 |
| B | 5.39 | 0.00 | 8.00 | 3.16 | 6.08 | 6.08 | 3.16 |
| D | 2.83 | 8.00 | 0.00 | 5.10 | 9.22 | 10.82 | 9.49 |
| E | 3.61 | 3.16 | 5.10 | 0.00 | 5.39 | 6.40 | 4.47 |
| F | 8.94 | 6.08 | 9.22 | 5.39 | 0.00 | 2.00 | 3.61 |
| G | 10.00 | 6.08 | 10.82 | 6.40 | 2.00 | 0.00 | 3.00 |
| H | 7.81 | 3.16 | 9.49 | 4.47 | 3.61 | 3.00 | 0.00 |
Find two closest data points.
| (A,C) | B | D | E | F | G | H | |
|---|---|---|---|---|---|---|---|
| (A,C) | 0.00 | 5.39 | 2.83 | 3.61 | 8.94 | 10.00 | 7.81 |
| B | 5.39 | 0.00 | 8.00 | 3.16 | 6.08 | 6.08 | 3.16 |
| D | 2.83 | 8.00 | 0.00 | 5.10 | 9.22 | 10.82 | 9.49 |
| E | 3.61 | 3.16 | 5.10 | 0.00 | 5.39 | 6.40 | 4.47 |
| F | 8.94 | 6.08 | 9.22 | 5.39 | 0.00 | 2.00 | 3.61 |
| G | 10.00 | 6.08 | 10.82 | 6.40 | 2.00 | 0.00 | 3.00 |
| H | 7.81 | 3.16 | 9.49 | 4.47 | 3.61 | 3.00 | 0.00 |
Update the distance matrix using single linkage.
| (A,C) | B | D | E | (F,G) | H | |
|---|---|---|---|---|---|---|
| (A,C) | 0.00 | 5.39 | 2.83 | 3.61 | 8.94 | 7.81 |
| B | 5.39 | 0.00 | 8.00 | 3.16 | 6.08 | 3.16 |
| D | 2.83 | 8.00 | 0.00 | 5.10 | 9.22 | 9.49 |
| E | 3.61 | 3.16 | 5.10 | 0.00 | 5.39 | 4.47 |
| (F,G) | 8.94 | 6.08 | 9.22 | 5.39 | 0.00 | 3.00 |
| H | 7.81 | 3.16 | 9.49 | 4.47 | 3.00 | 0.00 |
Important
We can cut the tree to partition the data into \(k\) clusters
This allows us to obtain different clustering solutions from a single tree!
One criterion for stability is that the number of clusters remains consistent over a wide range of tolerance levels.
In general, look for a long stretch of tolerance where the number of clusters does not change.
Tolerance level
Consider the axis representing distance as a measure of “tolerance level”.
When the distance between two clusters falls within this tolerance, they are merged into a single cluster.
As the tolerance increases, more clusters are merged, resulting in fewer overall clusters.
oliveRead this chapter about common patterns that will confuse clustering algorithm.
Hierarchical Clustering requires 4 steps:
scale()).dist()).hclust()) and visualize it (ggdendrogram()).Remember to profile the clusters and to provide insight into what these clusters may represent.
ETC1010/ETC5510 Lecture 10 | Melbourne time